As a city with a vibrant and culturally diverse population, Toronto’s food scene features a wide variety of cuisines from around the world, ranging from fast food to family-owned cafes and fine dining restaurants. With thousands of restaurants across the city, online ratings on platforms like Yelp can have a strong influence on the decisions of individuals, as well as the marketing and operational decision-making of the business owners. Using restaurant data from Yelp, I aim to investigate the research question: How do key factors, such as location, price, categories, authenticity and review count, contribute to the rating of a restaurant?
A study of restaurant inspection results in 2017 and 2018 in Toronto suggested that there are significant geospatial clustering patterns in restaurant quality, with hotspots of high infraction rates in North York and on the eastern shoreline (Ng et al. 2020). Moreover, a 2000 study of 63 restaurants in Toronto found that features that correlated with upscale restaurants had a positive effect on customer ratings (Susskind and Chan 2000). An Australian study identified that cuisines had differing reputations, with European cuisines associated with a higher reputation (Fogarty 2012). Additionally, a 2013 study in the United States suggested that restaurants perceived as more authentic received higher ratings, with chain restaurants viewed as less authentic than family-owned establishments (Kovács, Carroll, and Lehman 2014). These related studies motivated the inclusion of several variables in this study and suggest the following hypothesis: locations identified as infraction hotspots are expected to have lower ratings, while factors such as higher prices, European cuisine, and greater authenticity are anticipated to correlate with higher ratings.
The original dataset was extracted from the Yelp Fusion API (Yelp Inc., n.d.) and consisted of 6167 restaurants with 14 attributes, including average user rating, longitude, latitude, distance from the center of Toronto (defined as <43.64547, -79.42223>), a list of categories, review count, and user-rated price level. The business search API was used with the location set to ‘Toronto, ON’ and the term ‘restaurants’. Due to API limitations that only return the 240 “best match” results per query, the search was conducted independently over 134 selected categories and then combined into a single dataset. A neighbourhood attribute was added based on the longitude and latitude of restaurants through a spatial join with a neighbourhood dataset from Toronto Open Data (City of Toronto 2024).
The data was cleaned by removing erroneous and uninformative observations: restaurants with less than 3 reviews (so that the rating is less noisy or biased), restaurants with missing latitude and longitude (since location is important to the analysis), and restaurants not in Toronto (based on the boundaries from the neighbourhood dataset). The price level was converted into a factor with more interpretable levels, and categories were grouped into broader categories of restaurant type and cuisine (grouped by region). A variable representing categories of chain size was made, based on counts of the number of restaurants in Toronto with the same name. After cleaning, there were 3985 restaurants in the dataset and 13 variables.
To explore the data, summary statistics were calculated over select categorical variables, and multiple plots were created: the distributions of each variable; relationships between each variable and the target variable, rating; and maps showing the spatial distribution of the variables.
First, the distribution of rating, review count, distance, size level, and price level were plotted in Figure 1, which shows that ratings are left-skewed, while review count and distance are right-skewed. Moreover, most of the restaurants are single locations, while most of the known price levels are in the medium category.
Figure 1. Distribution of rating, review count, distance, size level, and price level
The distribution of the broader categories of restaurant type and cuisine in Figure 2 reveals that a large number of restaurants were uncategorized by type and cuisine. It also showed that of the categorized restaurants, there was the highest number of bars and east Asian restaurants.
Figure 2. Distribution of restaurant type and cuisine
Then, to investigate the distribution of restaurants and their ratings by neighbourhood, I summarize numeric variables by neighbourhood and plot their spatial distributions. As shown in the summary statistics of the top 10 neighbourhoods by restaurant count, in Table 1, the Kensington-Chinatown neighbourhood has the highest number of restaurants, at 297, followed by the Yonge-Bay Corridor and the Wellington Place neighbourhoods. The distribution on the map in Figure 3 suggests that neighbourhoods in the downtown area have the highest number of restaurants.
| Neighbourhood | Number of Restaurants | Average Rating | Average Review Count |
|---|---|---|---|
| Kensington-Chinatown | 297 | 3.811 | 103.084 |
| Yonge-Bay Corridor | 265 | 3.532 | 93.147 |
| Wellington Place | 220 | 3.753 | 135.123 |
| Annex | 159 | 3.761 | 95.101 |
| Trinity-Bellwoods | 156 | 3.990 | 88.449 |
| Downtown Yonge East | 126 | 3.752 | 126.921 |
| St Lawrence-East Bayfront-The Islands | 94 | 3.687 | 101.330 |
| South Riverdale | 92 | 3.987 | 73.815 |
| University | 75 | 3.776 | 63.600 |
| Bay-Cloverhill | 72 | 3.739 | 82.597 |
Figure 3. Distribution of restaurants by neighbourhood
Additionally, Table 2 lists the summary statistics for the top 10 neighbourhoods (with more than 5 restaurants) by average rating, with Alderwood leading at 4.440, followed by Weston and Broadview North. The mapped distribution of average ratings in Figure 4 shows no clear spatial clustering for ratings. Comparing the two tables and maps, the neighbourhoods with high average ratings have a lower number of restaurants and lower average review count.
| Neighbourhood | Number of Restaurants | Average Rating | Average Review Count |
|---|---|---|---|
| Alderwood | 10 | 4.440 | 20.300 |
| Weston | 12 | 4.283 | 34.083 |
| Broadview North | 6 | 4.200 | 24.500 |
| Blake-Jones | 13 | 4.138 | 46.231 |
| Regent Park | 11 | 4.109 | 10.818 |
| Danforth | 32 | 4.069 | 46.344 |
| Englemount-Lawrence | 9 | 4.067 | 28.889 |
| Dovercourt Village | 38 | 4.066 | 31.289 |
| Yonge-Doris | 51 | 4.045 | 72.137 |
| Woodbine Corridor | 9 | 4.044 | 46.889 |
Figure 4. Distribution of average ratings by neighbourhood
A one-way Analysis of Variance (ANOVA) test on the restaurant ratings by neighbourhood yields a p-value of approximatley 0.000 in Table 3, indicating that at least one neighbourhood has a statistically significant difference in restaurant rating.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| neighbourhood | 143 | 158.753 | 1.110 | 2.355 | 0 |
| Residuals | 3841 | 1810.570 | 0.471 | NA | NA |
I also investigate potential categorical effects on ratings for the other categorical variables: size level, price level, restaurant type and cuisine type.
Figure 5. Distribution of restaurant ratings by size level and price level
Looking at the boxplot showing the distribution of restaurant ratings by size level in Figure 5, the size of the chain has a negative relationship with median rating, with smaller-size restaurants having higher ratings on average. All size levels have similar spreads of ratings. A one-way ANOVA test, with results shown in Table 4 (p-value < 0.001) suggests that at least one size level has a statistically significant difference in rating.
Similarly, the boxplot of ratings by price level also displayed in
Figure 5 suggests that the rating of a restaurant
increases with price level, though restaurants with unknown
(NA) price level have the highest median rating. The spread
of ratings also decreases with price level. The one-way ANOVA test in
Table 5 (p-value < 0.001) also indicates that there
is a statistically significant difference in at least one price
level.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| size_level | 3 | 209.355 | 69.785 | 157.852 | 0 |
| Residuals | 3981 | 1759.968 | 0.442 | NA | NA |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| price_level | 3 | 9.757 | 3.252 | 8.08 | 0 |
| Residuals | 2107 | 848.072 | 0.403 | NA | NA |
Figure 6. Distribution of restaurant ratings by restaurant type
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| type | 12 | 94.221 | 7.852 | 16.632 | 0 |
| Residuals | 3972 | 1875.103 | 0.472 | NA | NA |
Figure 7. Distribution of restaurant ratings by cuisine
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| cuisine | 9 | 55.279 | 6.142 | 12.756 | 0 |
| Residuals | 3975 | 1914.044 | 0.482 | NA | NA |
Figure 8. Pearson correlation matrix for numerical variables: review count, distance, and rating of restaurants
Figure 9. Review Count & Distance vs Rating
The preliminary results suggest that there are large skews and imbalances within each variable. Moreover, although no clear spatial clustering of ratings by neighbourhood is observed, ANOVA tests suggest statistically significant differences in ratings across neighborhoods. Smaller restaurant chains and higher price levels correspond to higher ratings, though restaurants with unknown price levels rank the highest. Bistros, cafés and bakeries, and international and Middle Eastern cuisine receive the highest ratings, while diners, fast food, and North American cuisine rank the lowest. ANOVA tests indicate significant differences across all categorical variables. Numeric variables show weak negative correlations with ratings. Overall, smaller, pricier restaurants, and certain restaurant types and cuisines tend to have higher ratings, while location and numeric factors play a limited role.
Next, I will perform predictive modelling with regression to quantify the impact of these factors on restaurant rating. I will start by fitting a linear regression model with all of the predictors as a baseline, and refine it using All Possible Subsets Selection to choose a final linear regression model. Then, I will fit tree-based regression models, including a regression tree, a random forest, and XGBoost, tuning hyperparameters for each model. The performance of each model will be evaluated with R-squared and RMSE, with results compared across all models. I will also use the models to determine variable importances. These steps will provide insights to the research question.